Skip to content

Conversation

@sanderegg
Copy link
Member

@sanderegg sanderegg commented Sep 1, 2025

What do these changes do?

This PR changes the behavior of the dv-2 computational scheduler:
BEFORE:

  • if the clusters-keeper is not reachable,
  • if the dask scheduler is not reachable,
  • if rabbitmq is not reachable,
  • --> the pipeline would be directly set to FAILED state

AFTER:

  • --> the pipeline will be set to WAITING_FOR_CLUSTER state
  1. Since when we deploy or play around the clusters-keeper restarts wildly this will prevent failing pipelines straight away which were issues detected a while ago that should not be happening (see fixed issues).
  2. Also, a new constant was defined COMPUTATIONAL_BACKEND_MAX_WAITING_FOR_CLUSTER_TIMEOUT that could be used in case we should change that value (currently defaulted to 10 minutes).

tests driving changes: test_scheduler_dask.py

Related issue/s

How to test

Dev-ops

@sanderegg sanderegg added this to the Cheops milestone Sep 1, 2025
@sanderegg sanderegg self-assigned this Sep 1, 2025
@sanderegg sanderegg added the a:director-v2 issue related with the director-v2 service label Sep 1, 2025
@codecov
Copy link

codecov bot commented Sep 1, 2025

Codecov Report

❌ Patch coverage is 88.70968% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.76%. Comparing base (c8f61c1) to head (6e01185).
⚠️ Report is 1 commits behind head on master.

❗ There is a different number of reports uploaded between BASE (c8f61c1) and HEAD (6e01185). Click for more details.

HEAD has 30 uploads less than BASE
Flag BASE (c8f61c1) HEAD (6e01185)
unittests 31 1
Additional details and impacted files
@@             Coverage Diff             @@
##           master    #8286       +/-   ##
===========================================
- Coverage   86.62%   68.76%   -17.87%     
===========================================
  Files        1940      755     -1185     
  Lines       75306    34787    -40519     
  Branches     1311      175     -1136     
===========================================
- Hits        65237    23921    -41316     
- Misses       9674    10809     +1135     
+ Partials      395       57      -338     
Flag Coverage Δ
integrationtests 64.07% <74.19%> (+<0.01%) ⬆️
unittests 84.58% <88.70%> (-2.14%) ⬇️
Components Coverage Δ
pkg_aws_library ∅ <ø> (∅)
pkg_celery_library ∅ <ø> (∅)
pkg_dask_task_models_library ∅ <ø> (∅)
pkg_models_library ∅ <ø> (∅)
pkg_notifications_library ∅ <ø> (∅)
pkg_postgres_database ∅ <ø> (∅)
pkg_service_integration ∅ <ø> (∅)
pkg_service_library ∅ <ø> (∅)
pkg_settings_library ∅ <ø> (∅)
pkg_simcore_sdk 76.95% <ø> (-8.08%) ⬇️
agent ∅ <ø> (∅)
api_server ∅ <ø> (∅)
autoscaling ∅ <ø> (∅)
catalog ∅ <ø> (∅)
clusters_keeper ∅ <ø> (∅)
dask_sidecar ∅ <ø> (∅)
datcore_adapter ∅ <ø> (∅)
director ∅ <ø> (∅)
director_v2 91.04% <88.70%> (+12.80%) ⬆️
dynamic_scheduler ∅ <ø> (∅)
dynamic_sidecar 81.87% <ø> (-8.59%) ⬇️
efs_guardian ∅ <ø> (∅)
invitations ∅ <ø> (∅)
payments ∅ <ø> (∅)
resource_usage_tracker ∅ <ø> (∅)
storage ∅ <ø> (∅)
webclient ∅ <ø> (∅)
webserver 58.89% <ø> (-29.13%) ⬇️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c8f61c1...6e01185. Read the comment docs.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@mergify
Copy link
Contributor

mergify bot commented Sep 1, 2025

🧪 CI Insights

Here's what we observed from your CI run for 6e01185.

🟢 All jobs passed!

But CI Insights is watching 👀

@sanderegg sanderegg force-pushed the bugfix/8098/define-timeout-before-failing-pipelines branch from aad5768 to cda0a87 Compare September 2, 2025 09:34
@sanderegg sanderegg changed the title 🐛Do not fail a pipeline directly when computational backend misses one scheduling session 🐛🎨Do not fail a pipeline when the clusters-keeper or the computational backend in general is not reachable Sep 2, 2025
@sanderegg sanderegg force-pushed the bugfix/8098/define-timeout-before-failing-pipelines branch 2 times, most recently from 9c9e448 to f4e5d21 Compare September 2, 2025 14:19
@sanderegg sanderegg requested a review from Copilot September 2, 2025 14:37
@sanderegg sanderegg changed the title 🐛🎨Do not fail a pipeline when the clusters-keeper or the computational backend in general is not reachable 🐛🎨Do not fail a pipeline when the clusters-keeper or the computational backend in general is not reachable for short time Sep 2, 2025
@sanderegg sanderegg changed the title 🐛🎨Do not fail a pipeline when the clusters-keeper or the computational backend in general is not reachable for short time 🐛🎨Do not fail a pipeline when the clusters-keeper or the computational backend in general is not reachable for short time 🚨 Sep 2, 2025
@sanderegg sanderegg marked this pull request as ready for review September 2, 2025 14:41
@sanderegg sanderegg requested a review from pcrespov as a code owner September 2, 2025 14:41
Copy link
Collaborator

@matusdrobuliak66 matusdrobuliak66 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR changes the computational scheduler's behavior when backend services (clusters-keeper, dask scheduler, or rabbitmq) are unreachable. Instead of immediately setting pipelines to FAILED state, they are now set to WAITING_FOR_CLUSTER state, preventing premature failures during service restarts or deployments.

Key changes:

  • Modified exception handling to use WAITING_FOR_CLUSTER state instead of FAILED state for backend connectivity issues
  • Added timeout mechanism to eventually fail pipelines that wait too long for cluster availability
  • Updated test cases to reflect the new behavior and verify the timeout functionality

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
test_scheduler_dask.py Updated test to verify new WAITING_FOR_CLUSTER behavior and timeout functionality
conftest.py Added fixture for configuring short cluster wait timeout in tests
dask.py Added return type annotations and removed redundant error logging
_scheduler_base.py Implemented core logic changes for handling backend connectivity issues with WAITING_FOR_CLUSTER state
settings.py Added configuration setting for maximum cluster wait timeout

Copy link
Contributor

@wvangeit wvangeit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks.

Copy link
Member

@pcrespov pcrespov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx

@sanderegg sanderegg force-pushed the bugfix/8098/define-timeout-before-failing-pipelines branch from 4eb52bd to efe65be Compare September 3, 2025 20:31
@sanderegg sanderegg force-pushed the bugfix/8098/define-timeout-before-failing-pipelines branch from 2f4045f to 6e01185 Compare September 4, 2025 06:57
@sanderegg sanderegg added the 🤖-automerge marks PR as ready to be merged for Mergify label Sep 4, 2025
@sanderegg
Copy link
Member Author

@mergify queue

@mergify
Copy link
Contributor

mergify bot commented Sep 4, 2025

queue

🛑 Configuration not compatible with a branch protection setting

The branch protection setting Require branches to be up to date before merging is not compatible with max_parallel_checks>1, queue_conditions != merge_conditions and must be unset.

@sonarqubecloud
Copy link

sonarqubecloud bot commented Sep 4, 2025

Quality Gate Passed Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

@sanderegg sanderegg merged commit 1a721f8 into ITISFoundation:master Sep 4, 2025
92 of 95 checks passed
@sanderegg sanderegg deleted the bugfix/8098/define-timeout-before-failing-pipelines branch September 4, 2025 07:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🤖-automerge marks PR as ready to be merged for Mergify a:computational clusters a:director-v2 issue related with the director-v2 service

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Computations fail when Clusters-keeper service is not reachable During deployment jobs running in the background might be flagged as FAILED

5 participants